CMPT 825 : Natural Language Processing Spring 2008

نویسندگان

  • Anoop Sarkar
  • Mohsen Jamali
چکیده

Hidden Markov Models need a large set of parameters which are induced from a text-corpus. The parameters should be optimal in the sense that resulting models assign high probabilities to seen training data. There are several methods to estimate model parameters. The first one is to use each word as a state and estimate the probabilities using the relative frequencies. The second method is a variation of the first method. In this model, words are automatically grouped by similarity of distribution in the corpus. Each group is represented by a state in the model. The second method has the advantage of drastically reducing the number of model parameters and thereby reducing the sparse data problem. The third method uses manually defined categories. An important difference to the second method with automatically derived categories is that with manual definition a word can belong to more than one category. The fourth method is a variation of the third method and is also used for part of speech tagging. This method does not need a pre-annotated corpus for parameter estimation. The parameters are estimated using the Baum-Welch algorithm. This paper proposes a fifth method for estimating natural language models combining the advantages of the methods mentioned above.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cmpt 825: Natural Language Processing

1.1 Markov Processes Consider a set of states S1, S2, . . . SN . A discrete Markov process is one in which the system is in a particular state at any given time. The state can be changed only at discrete intervals of time. We denote the time instants associated with state changes with as t = 1, 2, . . . and we denote the actual state at time t as qt. The current state in an n-order Markov proce...

متن کامل

Cmpt 825: Natural Language Processing 12.0.1 Definitions

Tagger Program that tags a word (w i) in a text with its part of speech (POS) tag t i. Precision Precision is the number of correct responses out of the total number of responses. Recall Recall is the number of correct responses out of the number correct in the key. Word Features Features of a word that are used in characterizing the type of entity it is (i.e. whether a word is capitalized, whe...

متن کامل

Cmpt 825: Natural Language Processing 1.1 Hiding a Semantic Hierarchy in a Markov Model [1] 1.1.1 General Concepts

We know that in logic a predicate is a relation between its arguments. In other words, a predicate defines constraints between its arguments. A predicate ρ(v, r, c) is called selectional restriction where v is a verb, r is a role or an object and c is called a class which is a noun. Selectional preference σ : (v, r, c) → a is a function of these predicates to a real number. Where a shows the de...

متن کامل

On Distributed Concurrent Multi-Port Router Test System

This paper presents a framework of the distributed concurrent multi-port-testing test system (CMPT-TS) for IP routers under development at Sichuan Network Communication Key Laboratory. Having analyzed the actuality of concurrent testing for routers, this paper develops a distributed architecture of CMPT-TS and discusses its functional components in detail. Moreover, a new test definition langua...

متن کامل

Machine Reading

Over the last two decades or so, Natural Language Processing (NLP) has developed powerful methods for low-level syntactic and semantic text processing tasks such as parsing, semantic role labeling, and text categorization. Over the same period, the fields of machine learning and probabilistic reasoning have yielded important breakthroughs as well. It is now time to investigate how to leverage t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006